On-chip bacterial foraging training in silicon photonic circuits for projection-enabled nonlinear classification

On-chip training remains a challenging issue for photonic devices to implement machine learning algorithms. Most demonstrations only implement inference in photonics for offline-trained neural network models. On the other hand, artificial neural networks are one of the most deployed algorithms, while other machine learning algorithms such as supporting vector machine (SVM) remain unexplored in photonics. Here, inspired by SVM, we propose to implement projection-based classification principle by constructing nonlinear mapping functions in silicon photonic circuits and experimentally demonstrate on-chip bacterial foraging training for this principle to realize single Boolean logics, combinational Boolean logics, and Iris classification with ~96.7 − 98.3 per cent accuracy. This approach can offer comparable performances to artificial neural networks for various benchmarks even with smaller scales and without leveraging traditional activation functions, showing scalability advantage. Natural-intelligence-inspired bacterial foraging offers efficient and robust on-chip training, and this work paves a way for photonic circuits to perform nonlinear classification.

(sinusoidal multiplications), such a direct treatment on the data to generate projection-assisted classification effect cannot be fully covered by the traditional ANN frame. In addition, Supplementary

SVM-like principle's instantiations
SVM is a supervised learning algorithm, which seeks a hyperplane with a maximum margin in a higher dimensional projected space to separate data. Taking the case in Supplementary Fig. 1(d) as an example, even though detecting the optical power involves in a function of squared modulus, the classification is ready done in multiple dimensional complex vector space as seen from the distance to the origin of elements of y. As explained in the main text (section Model explanation), from the established equation Obviously, all data points have the same distance of 1/||w|| to the plane, which indicates that the margin is maximized and all XOR pattens are the support vectors. Similar condition is also satisfied for the case of Supplementary Fig. 1(e). Thus, training using the power as target is essentially same as using the distance in complex space. Training the power of y is equivalent to maximizing the distance, i.e., minimizing the ||w||^2, which is the same optimization target as used in the SVM algorithms (as shown by Eq. 1.24 in Ref. [3]).

SVM-like kernel description
SVM usually adopts kernel technique to save the computational cost for nonlinear mapping (Refs. [3,4]

Input data for experiment
Here we explain how the input data is prepared. As said in the manuscript, all input data x (XOR, Iris, etc.) are normalized to the phase in unit of  radians. We need the corresponding DAC of the normalized input data x. Since the phase is linear to the power, we can convert x to DAC (=voltage) by performing interpolation from xP  to x_dac using the power-DAC (voltage) curve, as shown in Supplementary Fig.   3(b), where P  is the -shift power (measured values are shown in Section 2.3 below). For each of x, this interpolation is done using the power-DAC curve of its corresponding heater. The DAC biases (off-state reference point for interpolation) are included as the training parameters.

Phase shifter efficiency
The

Experiment and simulation comparison of BFO training for XOR
In experiment, the algorithm controls the voltage, while in simulation, the algorithm controls the phase.
Supplementary Fig. 5(a) shows another BFO training experiment under different conditions, showing even better convergence than that used in Fig. 3 in the main text, which proves that the BFO training is not so strict on the condition and has higher success rate than RMSprop even with a larger voltage step.
The MSE of BFO training is compared between the experiment (from Fig. 3 in the main text) and two simulation cases using 10 (sim 1) and 5 (sim 2) chemotaxis loops in Supplementary Fig. 5(b). After 50 epochs, the experimental MSE is almost same as that of sim 2, both slightly larger than that of sim 1. The simulated optical power maps are shown in Supplementary Fig. 5(c). We have confirmed that setting the parameters is not strict in both simulation and experiment for BFO. No matter constant (e.g.,  = 0.03) or adaptive steps (e.g.,  = 0.03 or 0.05) were used, correct classification could be reached with only a difference in optical power like the figures in Supplementary Fig. 5(c), corresponding to a small difference in the final MSE. Different from simulation, the experiment suffers noise influence, thus, we need a slightly larger step at beginning and smaller ones with progressing for better convergence, for which we use an adaptive voltage step in experiment. The simulation uses the exact same structure in  (b) using port 0 (2) and 1 (3) as X = 0 and 1, respectively. X = B(b1, b2). B represents a Boolean logic operation.

Single and combinational 2-bit logic operation (simulation)
Beside XOR, various Boolean logic operations can be performed in our device. As shown in Supplementary Fig. 8, by reconfiguring the phase weights by BFO, we can achieve AND, OR, and NAND.
In real application, the differential operation between two target ports can be adopted. More importantly, combinational logics can be implemented simultaneously in this single device, such as XOR-AND, and OR-NAND, by assigning two ports for each 0 and 1 logic values. The port assignment is also not unique as seen from Supplementary Figs. 8(a) and 8(b). The XOR-AND logic works as a half adder, for which the port 3 in Supplementary Fig. 8(a) (or 1 in Supplementary Fig. 8(b)) outputting the sum and the port 7 (or 3) outputting the carrier. The realization of multiple logics in one photonic device could be used for performing advanced optical computing.

Single and combinational 2-bit logic operation with drop out
As mentioned in the main text, in the device, we use an MZI-mesh-based (Clements' topology in Ref. [5]) interferometer circuit to do matrix transformation for the projected vector. For a NM (if M<N) matrix transformation, when we use all ports in training, we impose an additional condition of energy conservation to this matrix transformation. This condition is automatically satisfied in unitary transformation, but this constraint is not guaranteed in an arbitrary matrix transformation. Thus, the configured matrix using all ports is using a NN unitary transformation to approximate a NM arbitrary matrix transformation. This approximation is sufficient for simple classification tasks, but it will slightly decrease the accuracy for complex nonlinear classification. Therefore, discarding some ports (drop out) is necessary to further enhance classification performance. Drop out was also used in photonic neural network as seen in Refs. [6,7]. In the convolutional neural network programs, random drop out is usually adopted to suppress over-fitting and to improve model generalization capability. We examine the effect of drop out for bit pattern recognition. As shown in Supplementary Fig. 9, the bit contrast can be improved compared to using all ports, but this is at the cost of scarifying some optical power.

Experiment and simulation comparison for Iris classification
Here we compare the experimental and simulated results for each algorithm, BFO and RMSprop.

Solution of voltage weights and power consumption: BFO vs RMSprop
Here

An example of ANN for Iris classification
An ANN model was programmed based on PyTorch (https://pytorch.org/), which consists of two linear layers and a ReLU activation layer. The output is the Softmax function. The training curve of accuracy is shown in Supplementary Fig. 12, showing a 96.67% (max) verification accuracy for the same test set used in our above on-chip training experiment.
Supplementary Fig. 12: A PyTorch-based class used to construct a 453 ANN model, and its the training curve of accuracy in classifying Iris dataset using the same RMSprop optimizer with a learning rate of 0.005.

Experimental robustness analysis for Iris classification
After training, we obtained the voltage weight at each heater (corresponding to Fig. 5 in the main text and see Supplementary Fig. 11). The learned voltage weights can be loaded into the device anytime for repeating verification. Then, we can intentionally introduce a random bias error to the weight and  Fig. 13. When the bias error <3%, there is no obvious degradation in accuracy for both BFO and RMSprop. For the bias error >3% and <7%, BFO seems having more robust training results than RMSprop since it has a slower decrease in accuracy and a slow increase in MSE. Further increase in the bias error >7% induces large random variations in accuracy and quick increase in MSE. Therefore, <3% bias control precision is required to avoid accuracy degradation. Supplementary Fig. 14 shows the confusion matrix of Iris classification at 3% and 7% bias errors on the BFO curves in Supplementary Fig.   13(a). At the 3% bias error, the labels 1 and 2 are wrongly recognized. But when increasing the error to 7%, the label 0 (=Setosa) seems suffering larger influences than the labels 1 (=Versicolor) and 2 (= Virginica).

On-chip training experiment with drop out for Iris classification
We did all the same things (on-chip training and robustness analysis) as did above for Iris classification with drop out (only using ports 1, 3, 5 for the output vector and discarding other ports, as explained in the main text and above in Supplementary Section 3.4). The experimental training curves are shown in Supplementary Fig. 15(a). After training, the accuracies were verified to be ~98.9% and ~98.3% for the train and test sets, respectively, as seen from the confusion matrices in Supplementary Fig. 15(b). In average, the accuracy is 98.7% for all 150 samples, and this experimental value is comparable to that (97.3%) (of complex photonic neural networks) in Ref. [8]. Next, we performed robustness analysis similarly as did in Supplementary Fig. 13(a). Supplementary Fig. 15(c) shows the bias error induced accuracy degradation. Compared to that without drop out, this accuracy is more sensitive to the bias error, >1% deviation causing a rapid decrease in accuracy. Therefore, with drop out, more precise voltage control is required. The experimental accuracies are consistent with the simulated ones in Supplementary Fig. 15(d).  Supplementary Fig. 24(c)).

Reproducibility and long-time stability
Reproducibility includes two aspects: sample reproducibility and measurement reproducibility. We selected another chip on the same wafer and completed wire bonding packaging. Using this wire bonding chip, we repeated all experiment as done in the previous experiment using 40-pin probes and obtained reproduceable experimental results in this revised manuscript, as shown below.

Long-time stability of Iris classification results
After BFO training, the learned voltage weights were re-loaded into the device repeatedly within several days and performed verification for the samples in both train and test sets. As shown in Supplementary were checked by repeatedly re-loading the voltage weights (in Supplementary Fig. 11) learned by BFO.

Long-time stability of port reconfiguration experiment
The MSE in Supplementary Section 5.1 reflects the long-time stability of the optical power distribution among all ports for the Iris classification. Here, we show the stability of each individual port after the light is reconfigured to each single port by BFO. For example, as shown in Supplementary Fig. 17 (a), we can automatically reconfigure the device to maximize the optical power at the ports 5 and 7. The start and end curves are before and after training. We did this for all ports and measured the MSE, as shown in Supplementary Fig. 17(b). For each port, we re-loaded the obtained voltage weights into the device and checked the MSE variations. We did not observe any degradation in MSE within one-week verification for each port, indicating the control reproducibility and device stability, as shown in the right figure of Supplementary Fig. 17(b).
Supplementary Fig. 17: (a) Automatic reconfiguration to route the light to the port 5 and 7 by BFO. The start curve is before training and the end curve is after training of 20 epochs. (b) Residual MSE for each port and its repeated measurement results within one week.

Measurement environment and device structure
We performed all experiments at room-temperature lab environment, without any thermal management steps, as seen in Supplementary Fig. 18. In previous experiment, we used 40-pin probes to contact the pads, however, in current experiment, we used a wire bonding chip instead. The wire bonding chip offers much better stability. As seen from the cross-section figure in Supplementary Fig. 18, the thermal equilibrium is mainly established at local surface since the cap layer is thin. Thus, inter-heater thermal interference is small, which is same as our previous large-scale optical switches in Ref. [9]. More

Robustness to device imperfection
We examine the influence of device imperfection (e.g., fabrication error) on reproducibility by simulation. In silicon photonics, one of the most sensitive components is the directional coupler (DC) (one MZI has two DCs). For switching application, it is required to be as close as to 3 dB ideally to guarantee low crosstalk as seen in Refs. [9,10]. For classification application, to clarify the influence of DC errors, we introduce a maximum deviation  (10%) as seen in the equation in Supplementary Fig. 19 which means the deviation ratio from /4. This deviation is randomly generated for all DCs. Taking XOR as the example, as shown in Supplementary Fig. 19, even with an error, the final MSE in training is almost same as that without error. Increasing the error does not monotonously increase the MSE and the training is still successful even with errors. As seen from the optical propagation inside the device, the final XOR ports (the port 1 denotes 0, the port 5 denotes 1) can also be correctly trained out, showing almost the same separation result. This is different from switching application, because for switching, the light always goes along a single path that is greatly influenced by the DC error in each MZI along the path; but for classification, the light serves for a multiple path interference. Even with the errors inside the device, the training will re-figure out a different multiple path interference that can convey the same classification information at the output. This can be seen from the light propagation in Supplementary   Figs. 19(a) and 19(b), in which the inside interference paths are different for with and without errors, however, the final classification results are same. For Iris classification, the final MSE even becomes smaller if including random errors, as seen in Supplementary Fig. 19(c). Therefore, the classification application is more robust to fabrication errors and the reproducibility of classification devices is in principle higher than other silicon photonic devices consisting of DCs such as switches and micro-rings.
BFO was used to obtain the results in Supplementary Fig. 19.
Supplementary Fig. 19: The optical power mapping shows the simulated optical propagation inside the device for XOR separation (a) without the DC error and (b) with the DC error  = 0.1,  is the directional coupler (DC) deviation ratio defined in the equation. (c) Comparison on MSE with and without including  for XOR and Iris.

Scalability examination by MNIST simulation
Here, we investigate the scalability of accuracy and power of the projection-based method by simulating MNIST (handwritten digit dataset [11]) classification. This simulation is classifying the k-space patterns of MNIST images obtained by FFT preprocessing [6,7]. We prepared four architectures G1G4 in Supplementary Fig. 20 for this case. Since the data extracted from k-space patterns are complex values, they are input via 11 MZI with an inner phase shifter  and an external phase shifter , which is related to the k-space data x as  = arcsin(|x|) and  = (arcsin(real(x)/|x|)) (minus sign when imag(x) < 0).
Thus, the architecture G1 is linear in the complex space, which can be treated as a reference. We use architecture G2G4 to form nonlinear mapping in the complex space by cascading two times data input with an intermediate VMM. For understanding these schemes, we take a two-element vector x = (x 1 , x 2 ) as an example to explain the mapping functions. Here we omit coefficients for simpleness and then the projected vector is x'=(x 1 (x 1 +x 2 ), x 2 (x 1 +x 2 )). For dot product with another vector v' where 0 1 1 0 . Thus, these mapping functions are quadratic-like ones in the complex space. One-hot optical power vector y 10 is extracted from the output ports 09 and the argmax(y 10 ) is used to mark the correct label. The training uses the normalized /‖ ‖ to calculate the MSE loss, which offers a faster convergence than other normalization methods and loss functions [7]. We used 500 images for each digit (loading images from the MNIST data file train-  Supplementary Fig. 20, the total parameters (weights) are indicated by summing the phase shifters in the order of Splitter + PS + VMM + PS + VMM. The scalability is examined by increasing the input parameters and scale, within an achievable range for current silicon photonic platforms. The maximum-scale VMM is consisted of 16(32+31) = 1008 MZIs, which is achievable as seen in [9,12].
RMSprop was adopted here for training, for comparison with ANN. Supplementary Fig. 20: Four architectures to generate mapping functions for classifying the FFT-processed kspace patterns of MNIST. G1-G3 take 16 components for input and G4 takes 32 components for input.

Scalability of accuracy and power
Supplementary Fig. 21(a) shows the accuracies of four architectures in Supplementary Fig. 20. For G1, this is a linear mapping since x is in complex domain, thus, its accuracy (~86%) is close to that (~85%) of the linear photonic ANN [6]. From G1 to G3, the accuracy can be enhanced to 9091% due to the projection effect. This is a benchmark to obtain >90% accuracy with only <900 parameters and without traditional nonlinear activation functions used in ANN. For G4, ~96.6% training accuracy and ~94% testing accuracy can be achieved with 1663 parameters. For ANN [6,8], it is known that such an accuracy cannot be achieved without nonlinear activation functions, evidencing a different principle from ANN.
For 16 and 32 input components, ~5% and ~4% accuracy enhancement can be achieved, respectively, by implementing nonlinear projection in G2G4 compared to the linear one in G1 (as seen by comparing Supplementary Fig. 21 (a) and Fig. 22). For quantitative description, the scalability of accuracy has a dependance of log(N^0.13) for training and log(N^0.09) for testing to the total parameter number N.

Influence of MZI layer number of MZI-mesh-based VMM
The VMM adopts the Clements' topology [5] as seen in Fig. 1a in the main text. Further adding more duplicate MZI layers (column) will overfit the matrix transformation and obviously will not degrade the performance. The lowest requirement of MZI layers is investigated for MNIST classification using the structure G1. As shown in Supplementary Fig. 22, the accuracy first increases with increasing the layer number and subsequently becomes almost constant after reaching its maximum, which means that further increasing the layer number will not contribute to accuracy enhancement anymore once the layer number is sufficient. By comparing the results of G3 and G4 in Supplementary Fig. 21(a) with the saturated maximum accuracies in Supplementary Figs. 22(a) and 22(b), respectively, we can see that nonlinear projection can contribute to ~45% accuracy enhancement.
Supplementary Fig. 22: Accuracy in relation to the MZI layer number (column number) of VMM for MNIST classification using: (a) 16-components input and (b) 32-components input. The column number here is counted as indicated in Fig. 1a in the main text, including two MZI sets (the column number = 2M, M is used in Supplementary Fig. 20). Total parameters of the entire structure are also indicated in the top axis.

With or without external phase shifters for MZI
Further adding more parameters such as increasing the layer number (see Supplementary Section 6.2) or adding external phase shifters (see schematic in Supplementary Fig. 23) for each MZI obviously will not degrade the classification performance, but the total number of training parameters will be increased, hence increasing the training time. In all above simulations no external phase shifters were used and the obtained training results are the specific solutions at  = 0. Here we compare the classification results with and without external phase shifters. After adding external phase shifters to all MZIs, we repeat the simulation in Supplementary Fig. 22(a). As seen in Supplementary Fig. 23, with external phase shifters, we can notice that (1) the accuracy can be enhanced for insufficient column numbers (e.g., 4, 8); (2) the accuracy reaches its maximum with fewer column number; (3) the maximum (saturated) accuracy has no increase compared to that without external phase shifters in Supplementary   Fig. 22(a). The observed slight decrease in accuracy after 24 columns is not true decrease since the increase in total parameter number needs more epochs to achieve better convergence, while here all simulations were fixed to 100 epochs. Supplementary Fig. 23: Accuracy in relation to MZI layer number (column number) of VMM with external phase shifters for all constituted MZIs. This simulation is same as did in Supplementary Fig. 22(a) except for adding  as additional training parameters. Total parameters are indicated in the top axis. At the right the schematic depicts the MZI with and without the external phase shifter .

BFO vs RMSprop for training four tasks
Experimental comparison between these two algorithms (BFO and forward propagation using RMSprop optimizers) has been discussed in the main text and Supplementary Sections 3 and 4 above. Here we compare them by simulation for all three tasks: XOR, Iris, Iris with drop out, and MNIST. The simulation was done in a similar way as experiment. The device was treated as a black box for which the algorithm did not know any intermediate parameters inside, like running the algorithm on a real chip. The algorithm used the same code as used in experiment. The only difference is that in simulation the phase is the training parameter, but in experiment, the voltage is the training parameter. Thus, the algorithm in simulation can be implemented in an exact same way in experiment without requiring doing additional works such as preparing optical error vector or chip calibration [8]. The PPC device here for XOR and Iris is same as that used in simulating the nonlinear dataset classification in Fig. 6 in the main text. All training curves including the accuracy and MSE are summarized in Supplementary Fig. 24. For XOR in Supplementary Fig. 24(a) and Iris in Supplementary Fig. 24

Benchmark comparison to a previous paper
We summarize all benchmarks in this work together with those demonstrated in a complex photonic neural network in Ref. Supplementary Fig. 25: Comparison between this work and a previous paper for various benchmarks. For Iris, in our work, the experimental testing accuracies 98.3% and 96.7% are obtained with and without drop out.

BFO source code and videos of BFO algorithm
Notes: (1) BFO engine used in our experiment and simulation is shown in Supplementary Fig. 26. Codes are made based on Python using PyTorch (//pytorch.org/) and other open packages.
(2) Three Supplementary Videos of BFO algorithm are uploaded for understanding this algorithm.